Skip to content

Conversation

@komer3
Copy link
Contributor

@komer3 komer3 commented Feb 4, 2026

Summary
This PR implements a complete, concurrency‑safe LKE node pool scale‑up flow for Karpenter. It enforces a strict 1:1 NodeClaim→Linode mapping, adds retryable error sentinels to distinguish transient failures, and brings Standard and Enterprise LKE tiers to parity. It also expands the fake Linode API to support more realistic tag filtering and node pool operations, adds LKE‑specific GC tests, and documents the full design.

What changed (by area)

LKE Provider

  • pkg/providers/lke/lke.go
    • Refactors Create() into smaller helpers (resolve instance type, lookup existing instance, pool locking).
    • Adds explicit retryable errors (ErrNoClaimableInstance, ErrClaimFailed, ErrPoolScaleFailed, ErrNodesProvisioning) and a bounded retry loop.
    • Introduces pool‑scoped keyed mutex to serialize claim/scale operations per (nodepool, instanceType).
    • Implements deterministic claim selection, tag verification, and pool scale‑up logic.
    • Handles Standard vs Enterprise pool discovery paths explicitly.

Concurrency / Architecture

  • pkg/utils/keyedmutex.go
    • New keyed mutex to guard pool‑scoped operations (pool lookup/create, claim/scale, orphan GC).

Fake Linode API

  • pkg/fake/linodeapi.go
    • Adds support for Linode list filter JSON with tag equality + contains.
    • Adds UpdateInstance + DeleteLKENodePoolNode behaviors.
    • Ensures LKE cluster ID + tags are propagated to fake instances.
    • Improves parity with real API to support LKE tests.

Linode SDK Interface

  • pkg/linode/sdk.go
    • Extends Linode API interface with UpdateInstance + DeleteLKENodePoolNode.

Garbage Collection

  • pkg/controllers/nodeclaim/garbagecollection/suite_test.go
    • Adds LKE Standard + Enterprise GC tests:
      • orphaned node deletion
      • multi‑node pool deletion behavior
      • owned node protection
      • resolution window coverage

Docs + Examples

  • docs/lke-nodepool-scaleup.md
    • New design doc describing invariants, claim‑or‑scale flow, tier differences, and rate‑limit analysis.
  • examples/v1/simple.yaml
    • Sets consolidateAfter: 30s in default NodePool example.
  • Makefile
    • Adds golangci‑lint tooling to verify.

Architectural Notes

  • Pool‑scoped locking ensures no double‑claim or conflicting pool scale operations.
  • Sentinel error types make retryable vs fatal Create errors explicit and testable.
  • Enterprise tier relies on Linode instance list with auto‑tags; Standard relies on pool Linodes + per‑instance tag inspection.

Tests

  • make test
  • make verify

Why this matters
This makes LKE mode robust for existing pool scale‑up, reduces race conditions, and improves observability and test coverage across both LKE tiers.

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 68.79845% with 161 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.04%. Comparing base (7f8beb4) to head (82aaeb8).

Files with missing lines Patch % Lines
pkg/providers/lke/lke.go 66.66% 40 Missing and 51 partials ⚠️
pkg/fake/linodeapi.go 68.11% 24 Missing and 20 partials ⚠️
pkg/controllers/nodeclass/controller.go 28.57% 2 Missing and 3 partials ⚠️
pkg/cloudprovider/cloudprovider.go 63.63% 0 Missing and 4 partials ⚠️
pkg/controllers/nodeclass/validation.go 50.00% 0 Missing and 3 partials ⚠️
pkg/cloudprovider/drift.go 66.66% 0 Missing and 2 partials ⚠️
pkg/controllers/controllers.go 0.00% 2 Missing ⚠️
pkg/operator/options/options.go 60.00% 2 Missing ⚠️
pkg/providers/instance/instance.go 71.42% 1 Missing and 1 partial ⚠️
pkg/utils/keyedmutex.go 84.61% 1 Missing and 1 partial ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   74.19%   67.04%   -7.16%     
==========================================
  Files          37       38       +1     
  Lines        2279     2640     +361     
==========================================
+ Hits         1691     1770      +79     
- Misses        452      552     +100     
- Partials      136      318     +182     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@komer3 komer3 changed the title [WIP] Scale existing lkenodepools [feat][improvement] new arch/implementation for scaling existing nodepools Feb 4, 2026
@komer3 komer3 requested a review from Copilot February 4, 2026 20:41
@komer3 komer3 marked this pull request as ready for review February 4, 2026 20:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the LKE provider to support concurrency-safe multi-node pool scaling with strict 1:1 NodeClaim→Linode mapping. Key changes include replacing pool-level tags with instance-level tags for claim ownership, implementing pool-scoped locking, adding retryable error types, and expanding test coverage for both Standard and Enterprise LKE tiers.

Changes:

  • Replaces pool-level NodeClaim tags with instance-level claim tags using UpdateInstance API
  • Introduces keyed mutex for pool-scoped concurrency control and claim/scale operations
  • Adds explicit retryable error sentinels (ErrNoClaimableInstance, ErrClaimFailed, etc.) and bounded retry logic
  • Expands fake Linode API to support tag filtering, UpdateInstance, and DeleteLKENodePoolNode

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pkg/providers/lke/lke.go Core LKE provider refactor: claim-or-scale flow, instance-level tagging, pool locking, tier-specific logic
pkg/utils/keyedmutex.go New keyed mutex for pool-scoped synchronization
pkg/utils/utils.go Removed custom Filter type, added GetInstanceTagsForLKE, GetTagValue, switched to linodego.Filter
pkg/fake/linodeapi.go Enhanced fake API: tag filtering, UpdateInstance, DeleteLKENodePoolNode support
pkg/providers/lke/suite_test.go Expanded LKE tests for Standard/Enterprise tiers: idempotency, claiming, scaling, error paths
pkg/controllers/nodeclaim/garbagecollection/suite_test.go Added LKE-specific GC tests for both tiers
pkg/providers/instance/types.go Removed pool-specific fields from Instance type
pkg/linode/sdk.go Extended LinodeAPI interface with UpdateInstance, DeleteLKENodePoolNode
pkg/operator/operator.go Updated to pass cluster name to LKE provider and use linodego.Filter
docs/lke-nodepool-scaleup.md New design doc detailing architecture, invariants, flow, and API call analysis
Comments suppressed due to low confidence (1)

pkg/providers/lke/suite_test.go:1

  • Corrected spelling of 'malformatedtag' to 'malformedtag'.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

pkg/providers/lke/suite_test.go:1

  • Corrected field name from 'ClusterName' to 'ClusterID' to match the actual parameter type.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add design document describing the core model, invariants, and flow for scaling existing LKE node pools in LKE mode. Covers 1:1 NodeClaim-to-VM mapping, tier-specific behavior (enterprise vs standard), concurrency model with keyed mutex, claim-or-scale loop with 15s timeout, and error handling policy.
Add golangci-lint as a managed tool in Makefile with version v2.8.0. Include it in verify target and tools phony target. Add renovate configuration for automated version updates.
This commit introduces comprehensive LKE NodePool scale-up functionality with proper concurrency control and extensive test coverage.

Major Changes:
- Add LKE NodePool scale-up capability to support existing NodePool expansion
- Implement KeyedMutex utility for per-key synchronization to prevent race conditions
- Enhance LKE provider with cluster name tracking and operation timeouts
- Add comprehensive test suites for garbage collection and LKE operations
Remove the `v1` import alias for the `v1alpha1` package throughout the codebase. Use the full `v1alpha1` package name directly to avoid confusion with the Karpenter v1 API. Update all references in cloudprovider, controllers, drift detection, and utility packages. Also fix incorrect finalizer constant reference from `v1.TerminationFinalizer` to `karpv1.TerminationFinalizer` in nodeclass controller.
Add explicit `ErrNodesProvisioning` error to handle nodes with zero InstanceID in standard tier. Check all nodes before returning error to prioritize finding claimable instances. Add 500ms sleep and retry when nodes are still provisioning. Remove redundant type declaration in fake client.
Extract LKE pool lookup into `findLKENodePoolFromLinodeInstanceID` helper to reduce duplication in Delete method. Remove redundant instance fetching, tag parsing, and pool mutex locking from Delete - all now handled by the helper. Use `findNodeInPool` helper for node lookup instead of inline loop. Update simple example to add 30s consolidation period. Reduce default Karpenter replicas from 2 to 1 in chart values.
…tag in LKE mode

Replace custom `lke-pool-id` tag with Linode's automatic `nodepool=<id>` tag for pool identification in enterprise tier. Use `LKEClusterID` field on instances for cluster association instead of tags. Update instance listing to filter by cluster ID field. Simplify pool resolution logic by using native Linode tags. Update fake client to set `LKEClusterID` on created instances. Remove pool ID tag from instance claim flow.

Fixed major issues with LKE enterprise creation flow (no more duplicate nodes being created due to API timing). Also fixed List() behaviour that was throwing a lot of errors.
komer3 and others added 8 commits February 5, 2026 15:24
- Instance model cleanup: Remove NodeID, Labels, Taints, PoolID from Instance struct; simplify NewLKEInstance constructor. Reduces per-instance memory and eliminates unused pool-derived fields. These field were not being used at all by the instanceToNodeClaim() consumer.

- API call reduction: Eliminate pool lookups in Get/List/hydrateInstanceFromLinode; use direct instance data. Cuts ListInstances + GetLKENodePool calls per operation. Now we just do 1 API call.

- Enterprise test parity: Add comprehensive tests for enterprise-tier garbage collection with proper tag-based discovery and provider setup. Did the same comprehensive parity test case additions for lke provider tests.

- Test suite restructuring: Reorganize LKE suite tests under explicit Standard/Enterprise tier contexts with consistent subcontexts (Create, Idempotency, Multi-node scaling, Tag verification, Error paths, Create options, Get, List, Delete, CreateTags). Remove duplicate top-level Create options and pool filtering tests.

- Documentation: Update LKE exisiting nodepool scaleup docs with clarified tier behaviours, API call budgets, and scalability notes (more info on potential rate limit we can hit).
…helper extraction. Reduced the linter complexity.

Extract instance type resolution, existing instance lookup, and pool locking into separate helper methods. Introduce explicit error types (ErrNoClaimableInstance, ErrClaimFailed, ErrPoolScaleFailed) to distinguish retry-able conditions from fatal errors. Replace inline pool mutex lock/unlock with withPoolLock helper for cleaner resource management. Simplify attemptCreate by removing nodeID return value and using instance
Add ProviderConfig struct to LKE provider with CreateDeadline, TagVerificationTimeout, and RetryDelay fields. Add corresponding CLI flags and environment variables (LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, LKE_RETRY_DELAY) with defaults matching previous hardcoded values (10s, 4s, 2s). Thread config through operator initialization and test setup. Extract retry error check into isRetryableCreateError helper. Replace hardcoded sleep
Add lkeCreateDeadline, lkeTagVerificationTimeout, and lkeRetryDelay settings to chart values with defaults (10s, 4s, 2s). Wire settings through deployment template as LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, and LKE_RETRY_DELAY environment variables. Add documentation comments for each timeout parameter.
Add retryable error checks to Create and verifyTagsApplied loops using utils.IsRetryableError. Distinguish between retryable and fatal errors in verifyTagsApplied by returning wrapped error for non-retryable cases. Update Get method to use context parameter instead of Background() and distinguish NotFound errors from other API failures. Expand IsRetryableError to include timeout and EOF errors (copied from CAPL)
Add design document describing the core model, invariants, and flow for scaling existing LKE node pools in LKE mode. Covers 1:1 NodeClaim-to-VM mapping, tier-specific behavior (enterprise vs standard), concurrency model with keyed mutex, claim-or-scale loop with 15s timeout, and error handling policy.
Add golangci-lint as a managed tool in Makefile with version v2.8.0. Include it in verify target and tools phony target. Add renovate configuration for automated version updates.
Replace colon (':') with equals sign ('=') as the delimiter in tag formatting throughout LKE provider implementation and tests. Update tag creation in instance provider, tag verification logic, pool tag generation, and all test assertions to use 'key=value' format instead of 'key:value' format.
Remove custom KeyedMutex implementation from pkg/utils and replace with keymutex.KeyMutex from k8s.io/utils/keymutex. Update LKE provider to use keymutex.NewHashed(0) and LockKey/UnlockKey methods instead of custom Lock/Unlock methods.
Copy link
Contributor

@AshleyDumaine AshleyDumaine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested locally since Github is partially on fire.

@AshleyDumaine AshleyDumaine merged commit 3a75db8 into main Feb 9, 2026
@AshleyDumaine AshleyDumaine deleted the scale-existing-lkenodepools branch February 9, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants